Blog Post 5

Post5

ManiShankerKamarapu

Amazon Review analysis

Author

Mani Shanker Kamarapu

Published

November 16, 2022

Introduction

In the last post, I have tidyed data more and analysis data using visualizations. In this blog I plan to do sentimental analysis and compare different lexicons.

Loading the libraries

Code

library(polite)
library(rvest)

Warning: package 'rvest' was built under R version 4.2.2

Code

library(ggplot2)

Warning: package 'ggplot2' was built under R version 4.2.2

Code

library(plotly)


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

Code

library(tidyverse)

── Attaching packages
───────────────────────────────────────
tidyverse 1.3.2 ──

✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.4.1 
✔ readr   2.1.3      ✔ forcats 0.5.2 
✔ purrr   0.3.5      
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter()         masks plotly::filter(), stats::filter()
✖ readr::guess_encoding() masks rvest::guess_encoding()
✖ dplyr::lag()            masks stats::lag()

Code

library(SnowballC)
library(stringr)
library(quanteda)

Package version: 3.2.3
Unicode version: 13.0
ICU version: 69.1
Parallel computing: 8 of 8 threads used.
See https://quanteda.io for tutorials and examples.

Code

library(tidyr)
library(reshape2)


Attaching package: 'reshape2'

The following object is masked from 'package:tidyr':

    smiths

Code

library(RColorBrewer)
library(tidytext)
library(quanteda.textplots)
library(wordcloud)
library(textdata)

Error in library(textdata): there is no package called 'textdata'

Code

library(gridExtra)


Attaching package: 'gridExtra'

The following object is masked from 'package:dplyr':

    combine

Code

library(wordcloud2)
library(devtools)

Loading required package: usethis

Code

library(quanteda.dictionaries)
library(quanteda.sentiment)


Attaching package: 'quanteda.sentiment'

The following object is masked from 'package:quanteda':

    data_dictionary_LSD2015

Code

knitr::opts_chunk$set(echo = TRUE)

Reading the data

Code

reviews <- read_csv("amazonreview.csv")

New names:
Rows: 46450 Columns: 6
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(4): review_title, review_text, review_star, ASIN dbl (2): ...1, page
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`

Code

reviews

Pre-processing function

Code

clean_text <- function (text) {
  str_remove_all(text," ?(f|ht)(tp)(s?)(://)(.*)[.|/](.*)") %>% 
    # Remove mentions
    str_remove_all("@[[:alnum:]_]*") %>% 
    # Replace "&" character reference with "and"
    str_replace_all("&amp;", "and") %>%
    # Remove punctuation
    str_remove_all("[[:punct:]]") %>%
    # remove digits
    str_remove_all("[[:digit:]]") %>%
    # Replace any newline characters with a space
    str_replace_all("\\\n|\\\r", " ") %>%
    # remove strings like "<U+0001F9F5>"
    str_remove_all("<.*?>") %>% 
    # Make everything lowercase
    str_to_lower() %>%
    # Remove any trailing white space around the text and inside a string
    str_squish()
}

Tidying the data

Code

reviews$clean_text <- clean_text(reviews$review_text) 
reviews <- reviews %>%
  drop_na(clean_text)
reviews

Removing unnecessary columns

Code

reviews <- reviews %>%
  select(-c(...1, page, review_text))
reviews

Pre-processing the title variable

Code

reviews$review_title <- reviews$review_title %>%
  str_remove_all("\n")
reviews

Converting star of reviews from character to numeric

Code

reviews$review_star <- substr(reviews$review_star, 1, 3) %>%
  as.numeric()
  reviews

Adding new variable book title to the reviews

Code

reviews <- reviews %>%
  mutate(book_title = case_when(ASIN == "B0001DBI1Q" ~ "A Game of Thrones: A Song of Ice and Fire, Book 1", 
                                ASIN == "B0001MC01Y" ~ "A Clash of Kings: A Song of Ice and Fire, Book 2", 
                                ASIN == "B00026WUZU" ~ "A Storm of Swords: A Song of Ice and Fire, Book 3", 
                                ASIN == "B07ZN4WM13" ~ "A Feast for Crows: A Song of Ice and Fire, Book 4", 
                                ASIN == "B005C7QVUE" ~ "A Dance with Dragons: A Song of Ice and Fire, Book 5", 
                                ASIN == "B000BO2D64" ~ "Twilight: The Twilight Saga, Book 1", 
                                ASIN == "B000I2JFQU" ~ "New Moon: The Twilight Saga, Book 2", 
                                ASIN == "B000UW50LW" ~ "Eclipse: The Twilight Saga, Book 3", 
                                ASIN == "B001FD6RLM" ~ "Breaking Dawn: The Twilight Saga, Book 4 ", 
                                ASIN == "B07HHJ7669" ~ "The Hunger Games", 
                                ASIN == "B07T6BQV2L" ~ "Catching Fire: The Hunger Games", 
                                ASIN == "B07T43YYRY" ~ "Mockingjay: The Hunger Games, Book 3"))
reviews

Adding new variable series title to the reviews

Code

reviews <- reviews %>%
  mutate(series_title = case_when(ASIN == "B0001DBI1Q" ~ "A Song of Ice and Fire", 
                                ASIN == "B0001MC01Y" ~ "A Song of Ice and Fire", 
                                ASIN == "B00026WUZU" ~ "A Song of Ice and Fire", 
                                ASIN == "B07ZN4WM13" ~ "A Song of Ice and Fire", 
                                ASIN == "B005C7QVUE" ~ "A Song of Ice and Fire", 
                                ASIN == "B000BO2D64" ~ "The Twilight Saga", 
                                ASIN == "B000I2JFQU" ~ "The Twilight Saga", 
                                ASIN == "B000UW50LW" ~ "The Twilight Saga", 
                                ASIN == "B001FD6RLM" ~ "The Twilight Saga", 
                                ASIN == "B07HHJ7669" ~ "The Hunger Games", 
                                ASIN == "B07T6BQV2L" ~ "The Hunger Games", 
                                ASIN == "B07T43YYRY" ~ "The Hunger Games"))
reviews

Sentimental analysis

There are a variety of dictionaries that exist for evaluating the opinion or emotion in text. In this project we focus and compare between two types of lexicons in the sentiments data set. The two lexicons are

bing
nrc

All two of these lexicons are based on unigrams (or single words). These lexicons contain many English words and the words are assigned scores for positive/negative sentiment, and also possibly emotions like joy, anger, sadness, and so forth. The nrc lexicon categorizes words in a binary fashion (“yes”/“no”) into categories of positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. The bing lexicon categorizes words in a binary fashion into positive and negative categories. All of this information is tabulated in the sentiments dataset, and tidytext provides a function get_sentiments() to get specific sentiment lexicons without the columns that are not used in that lexicon.

Code

reviews1 <- reviews %>%
  filter(series_title == "A Song of Ice and Fire")
reviews2 <- reviews %>%
  filter(series_title == "The Twilight Saga")
reviews3 <- reviews %>%
  filter(series_title == "The Hunger Games")

Tokenization of data

Code

# Converting the text into corpus
text_corpus1 <- corpus(c(reviews1$clean_text))
# Converting the text into tokens
text_token1 <- tokens(text_corpus1, remove_punct=TRUE, remove_numbers = TRUE, remove_separators = TRUE, remove_symbols = TRUE) %>% 
  tokens_select(pattern=c(stopwords("en"), "im", "didnt", "couldnt","wasnt", "id", "ive", "isnt", "dont", "wont", "shes", "doesnt"), selection="remove") %>%
  tokens_select(pattern=stopwords("SMART"), 
                selection="remove")

Warning: 'stopwords(language = "SMART")' is deprecated.
Use 'stopwords(source = "smart")' instead.
See help("Deprecated")

Code

# Converting tokens into Document feature matrix
text_dfm1 <- dfm(text_token1)
text_dfm1

Document-feature matrix of: 19,999 documents, 35,832 features (99.93% sparse) and 0 docvars.
       features
docs    love fantasy kid stories set creative worlds featuring varied groups
  text1    1      12   1       3   1        1      2         1      1      1
  text2    2       8   0       1   1        0      0         0      1      0
  text3    1       1   0       0   0        0      0         0      0      0
  text4    0       7   0       1   0        0      0         0      0      0
  text5    0      12   0       0   0        0      0         0      0      0
  text6    0       0   0       1   0        0      0         0      0      0
[ reached max_ndoc ... 19,993 more documents, reached max_nfeat ... 35,822 more features ]

Code

# Converting the text into corpus
text_corpus2 <- corpus(c(reviews2$clean_text))
# Converting the text into tokens
text_token2 <- tokens(text_corpus2, remove_punct=TRUE, remove_numbers = TRUE, remove_separators = TRUE, remove_symbols = TRUE) %>% 
  tokens_select(pattern=c(stopwords("en"), "im", "didnt", "couldnt","wasnt", "id", "ive", "isnt", "dont", "wont", "shes", "doesnt"), selection="remove") %>%
  tokens_select(pattern=stopwords("SMART"), 
                selection="remove")

Warning: 'stopwords(language = "SMART")' is deprecated.
Use 'stopwords(source = "smart")' instead.
See help("Deprecated")

Code

# Converting tokens into Document feature matrix
text_dfm2 <- dfm(text_token2)
text_dfm2

Document-feature matrix of: 14,449 documents, 41,179 features (99.91% sparse) and 0 docvars.
       features
docs    working professional mother time reading literary snob find stick
  text1       1            1      2    7       5        2    1    3     1
  text2       0            0      1    5       1        0    0    3     0
  text3       0            0      0    0       2        0    0    2     0
  text4       1            0      3    0       1        1    0    7     1
  text5       1            0      3    0       0        0    0    1     0
  text6       0            0      0    1       2        0    0    0     0
       features
docs    classics
  text1        2
  text2        0
  text3        0
  text4        0
  text5        0
  text6        0
[ reached max_ndoc ... 14,443 more documents, reached max_nfeat ... 41,169 more features ]

Code

# Converting the text into corpus
text_corpus3 <- corpus(c(reviews3$clean_text))
# Converting the text into tokens
text_token3 <- tokens(text_corpus3, remove_punct=TRUE, remove_numbers = TRUE, remove_separators = TRUE, remove_symbols = TRUE) %>% 
  tokens_select(pattern=c(stopwords("en"), "im", "didnt", "couldnt","wasnt", "id", "ive", "isnt", "dont", "wont", "shes", "doesnt"), selection="remove") %>%
  tokens_select(pattern=stopwords("SMART"), 
                selection="remove")

Warning: 'stopwords(language = "SMART")' is deprecated.
Use 'stopwords(source = "smart")' instead.
See help("Deprecated")

Code

# Converting tokens into Document feature matrix
text_dfm3 <- dfm(text_token3)
text_dfm3

Document-feature matrix of: 11,999 documents, 29,812 features (99.89% sparse) and 0 docvars.
       features
docs    began book amount trepidation popular target audience older readers
  text1     1    9      1           1       1      1        1     1       3
  text2     0    0      1           0       0      0        0     0       0
  text3     0   15      0           0       0      0        1     0       0
  text4     0   11      0           0       0      0        0     0       0
  text5     0    7      0           0       0      1        0     0       0
  text6     0    6      0           0       0      0        0     0       1
       features
docs    tooand
  text1      1
  text2      0
  text3      0
  text4      0
  text5      0
  text6      0
[ reached max_ndoc ... 11,993 more documents, reached max_nfeat ... 29,802 more features ]

Code

textplot_wordcloud(text_dfm1, min_size = 1.5, max_size = 4, random_order = TRUE, max_words = 150, min_count = 50, color = brewer.pal(8, "Dark2") )

Code

textplot_wordcloud(text_dfm2, min_size = 1.2, max_size = 3.5, random_order = TRUE, max_words = 150, min_count = 50, color = brewer.pal(8, "Dark2") )

Code

textplot_wordcloud(text_dfm3, min_size = 1.5, max_size = 4, random_order = TRUE, max_words = 150, min_count = 50, color = brewer.pal(8, "Dark2") )

Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, : book
could not be fit on page. It will not be plotted.

Code

word_counts1 <- as.data.frame(sort(colSums(text_dfm1),dec=T))
colnames(word_counts1) <- c("Frequency")
word_counts1$word <- row.names(word_counts1)
word_counts1$Rank <- c(1:ncol(text_dfm1))
word_counts2 <- as.data.frame(sort(colSums(text_dfm2),dec=T))
colnames(word_counts2) <- c("Frequency")
word_counts2$word <- row.names(word_counts2)
word_counts2$Rank <- c(1:ncol(text_dfm2))
word_counts3 <- as.data.frame(sort(colSums(text_dfm3),dec=T))
colnames(word_counts3) <- c("Frequency")
word_counts3$word <- row.names(word_counts3)
word_counts3$Rank <- c(1:ncol(text_dfm3))

Code

Sentiment1_bing <- word_counts1 %>%
 inner_join(get_sentiments("bing"), by = "word")
Sentiment1_nrc <- word_counts1 %>%
  inner_join(get_sentiments("nrc"), by = "word")

Error: The textdata package is required to download the NRC word-emotion association lexicon.
Install the textdata package to access this dataset.

Code

Sentiment2_bing <- word_counts2 %>%
 inner_join(get_sentiments("bing"), by = "word")
Sentiment2_nrc <- word_counts2 %>%
  inner_join(get_sentiments("nrc"), by = "word")

Error: The textdata package is required to download the NRC word-emotion association lexicon.
Install the textdata package to access this dataset.

Code

Sentiment3_bing <- word_counts3 %>%
 inner_join(get_sentiments("bing"), by = "word")
Sentiment3_nrc <- word_counts3 %>%
  inner_join(get_sentiments("nrc"), by = "word")

Error: The textdata package is required to download the NRC word-emotion association lexicon.
Install the textdata package to access this dataset.

Code

p1 <- Sentiment1_bing %>% 
  group_by(sentiment) %>% 
  count() %>% 
  ungroup() %>% 
  mutate(perc = `n` / sum(`n`)) %>% 
  arrange(perc) %>%
  mutate(labels = scales::percent(perc)) %>%
  ggplot(aes(x = "", y = perc, fill = as.factor(sentiment))) +
  ggtitle("Postive vs Negative count") +
  geom_col(color = "black") +
  geom_label(aes(label = labels), color = c(1, "white"),
            position = position_stack(vjust = 0.5),
            show.legend = FALSE) +
  guides(fill = guide_legend(title = "Sentiment")) +
  scale_fill_viridis_d() +
  coord_polar(theta = "y") + 
  theme_void() 
p2 <- Sentiment1_nrc %>% 
  group_by(sentiment) %>% 
  count() %>% 
  ungroup() %>% 
  mutate(perc = `n` / sum(`n`)) %>% 
  arrange(perc) %>%
  mutate(labels = scales::percent(perc)) %>%
  ggplot(aes(x = "", y = perc, fill = as.factor(sentiment))) +
  ggtitle("Emotions count") +
  geom_col(color = "black") +
  geom_label(aes(label = labels), color = c(1, "white", "white", "white", "white", "white", "white", "white", "white", "white"),
            position = position_stack(vjust = 0.5),
            show.legend = FALSE) +
  guides(fill = guide_legend(title = "Sentiment")) +
  scale_fill_viridis_d() +
  coord_polar(theta = "y") + 
  theme_void()

Error in group_by(., sentiment): object 'Sentiment1_nrc' not found

Code

grid.arrange(arrangeGrob(p1, p2, ncol = 2),
             nrow = 1)

Error in arrangeGrob(p1, p2, ncol = 2): object 'p2' not found

Code

p1 <- Sentiment2_bing %>% 
  group_by(sentiment) %>% 
  count() %>% 
  ungroup() %>% 
  mutate(perc = `n` / sum(`n`)) %>% 
  arrange(perc) %>%
  mutate(labels = scales::percent(perc)) %>%
  ggplot(aes(x = "", y = perc, fill = as.factor(sentiment))) +
  ggtitle("Postive vs Negative count") +
  geom_col(color = "black") +
  geom_label(aes(label = labels), color = c(1, "white"),
            position = position_stack(vjust = 0.5),
            show.legend = FALSE) +
  guides(fill = guide_legend(title = "Sentiment")) +
  scale_fill_viridis_d() +
  coord_polar(theta = "y") + 
  theme_void()
p2 <- Sentiment2_nrc %>% 
  group_by(sentiment) %>% 
  count() %>% 
  ungroup() %>% 
  mutate(perc = `n` / sum(`n`)) %>% 
  arrange(perc) %>%
  mutate(labels = scales::percent(perc)) %>%
  ggplot(aes(x = "", y = perc, fill = as.factor(sentiment))) +
  ggtitle("Emotions count") +
  geom_col(color = "black") +
  geom_label(aes(label = labels), color = c(1, "white", "white", "white", "white", "white", "white", "white", "white", "white"),
            position = position_stack(vjust = 0.5),
            show.legend = FALSE) +
  guides(fill = guide_legend(title = "Sentiment")) +
  scale_fill_viridis_d() +
  coord_polar(theta = "y") + 
  theme_void()

Error in group_by(., sentiment): object 'Sentiment2_nrc' not found

Code

grid.arrange(arrangeGrob(p1, p2, ncol = 2),
             nrow = 1)

Error in arrangeGrob(p1, p2, ncol = 2): object 'p2' not found

Code

p1 <- Sentiment3_bing %>% 
  group_by(sentiment) %>% 
  count() %>% 
  ungroup() %>% 
  mutate(perc = `n` / sum(`n`)) %>% 
  arrange(perc) %>%
  mutate(labels = scales::percent(perc)) %>%
  ggplot(aes(x = "", y = perc, fill = as.factor(sentiment))) +
  ggtitle("Postive vs Negative count") +
  geom_col(color = "black") +
  geom_label(aes(label = labels), color = c(1, "white"),
            position = position_stack(vjust = 0.5),
            show.legend = FALSE) +
  guides(fill = guide_legend(title = "Sentiment")) +
  scale_fill_viridis_d() +
  coord_polar(theta = "y") + 
  theme_void()
p2 <- Sentiment3_nrc %>% 
  group_by(sentiment) %>% 
  count() %>% 
  ungroup() %>% 
  mutate(perc = `n` / sum(`n`)) %>% 
  arrange(perc) %>%
  mutate(labels = scales::percent(perc)) %>%
  ggplot(aes(x = "", y = perc, fill = as.factor(sentiment))) +
  ggtitle("Emotions count") +
  geom_col(color = "black") +
  geom_label(aes(label = labels), color = c(1, "white", "white", "white", "white", "white", "white", "white", "white", "white"),
            position = position_stack(vjust = 0.5),
            show.legend = FALSE) +
  guides(fill = guide_legend(title = "Sentiment")) +
  scale_fill_viridis_d() +
  coord_polar(theta = "y") + 
  theme_void()

Error in group_by(., sentiment): object 'Sentiment3_nrc' not found

Code

grid.arrange(arrangeGrob(p1, p2, ncol = 2),
             nrow = 1)

Error in arrangeGrob(p1, p2, ncol = 2): object 'p2' not found

Code

p1 <- Sentiment1_bing %>%
 filter(Frequency > 600) %>%
 mutate(Frequency = ifelse(sentiment == "negative", -Frequency, Frequency)) %>%
 mutate(word = reorder(word, Frequency)) %>%
 ggplot(aes(word, Frequency, fill = sentiment))+
 geom_col() +
 coord_flip() +
 labs(y = "Sentiment Score")
p2 <- Sentiment2_bing %>%
 filter(Frequency > 700) %>%
 mutate(Frequency = ifelse(sentiment == "negative", -Frequency, Frequency)) %>%
 mutate(word = reorder(word, Frequency)) %>%
 ggplot(aes(word, Frequency, fill = sentiment))+
 geom_col() +
 coord_flip() +
 labs(y = "Sentiment Score")
p3 <- Sentiment3_bing %>%
 filter(Frequency > 500) %>%
 mutate(Frequency = ifelse(sentiment == "negative", -Frequency, Frequency)) %>%
 mutate(word = reorder(word, Frequency)) %>%
 ggplot(aes(word, Frequency, fill = sentiment))+
 geom_col() +
 coord_flip() +
 labs(y = "Sentiment Score")
grid.arrange(arrangeGrob(p1, p2, ncol = 2),
             p3,
             nrow = 2)

Code

Sentiment1_nrc %>% 
 filter(Frequency > 500) %>%
 mutate(word = reorder(word, Frequency)) %>%
 ggplot(aes(word, Frequency))+
 facet_wrap(~ sentiment, scales = "free", nrow = 5) +
 geom_col() +
 coord_flip() +
 labs(y = "Sentiment Score")

Error in filter(., Frequency > 500): object 'Sentiment1_nrc' not found

Code

Sentiment2_nrc %>% 
 filter(Frequency > 600) %>%
 mutate(word = reorder(word, Frequency)) %>%
 ggplot(aes(word, Frequency))+
 facet_wrap(~ sentiment, scales = "free", nrow = 5) +
 geom_col() +
 coord_flip() +
 labs(y = "Sentiment Score")

Error in filter(., Frequency > 600): object 'Sentiment2_nrc' not found

Code

Sentiment3_nrc %>% 
 filter(Frequency > 500) %>%
 mutate(word = reorder(word, Frequency)) %>%
 ggplot(aes(word, Frequency))+
 facet_wrap(~ sentiment, scales = "free", nrow = 5) +
 geom_col() +
 coord_flip() +
 labs(y = "Sentiment Score")

Error in filter(., Frequency > 500): object 'Sentiment3_nrc' not found

Code

Sentiment1_bing %>%
acast(word ~ sentiment, value.var = "Frequency", fill = 0) %>%
 comparison.cloud(colors = c("red", "dark green"),
          max.words = 100)

Code

Sentiment2_bing %>%
acast(word ~ sentiment, value.var = "Frequency", fill = 0) %>%
 comparison.cloud(colors = c("red", "dark green"),
          max.words = 150)

Code

Sentiment3_bing %>%
acast(word ~ sentiment, value.var = "Frequency", fill = 0) %>%
 comparison.cloud(colors = c("red", "dark green"),
          max.words = 100)

Further study

I will do topic modelling in the next blog and try to analyse it.